First, we must know what is the definition of the unemployment rate to understand how it is derived and what factors are related to it. A person is defined as unemployed in the United States if they are jobless, but have looked for work in the last four weeks and are available for work. To record unemployed, Government distributes survey to sampling population and predict the entire unemployed number in a broad area. Measuring the unemployment gives us a good overview of the ongoing status of the economy, international competition, technology development, and so on.
The equation of the Unemployment Rate is
\[unemployment \space rate = \frac{unemployed}{labor \space force}\times100\] where labor force includes all people age 16 and older who are classified as either employed and unemployed.
In this project, I will focus on analyzing and predicting the unemployment rate in the LA County.
unprocessed = read.csv("D:/UCSB/Spring_2022/PSTAT 131/PSTAT_131_HW/HW2/PSTAT-131/Final Project/data/unprocessed_data.csv")
head(unprocessed)
## DATE unemploy_rate_la avg_price_pipedgas_la avg_price_electr_kwh_La
## 1 1990/1/1 5.9 18.662 0.105
## 2 1990/2/1 5.6 19.971 0.108
## 3 1990/3/1 5.4 19.971 0.108
## 4 1990/4/1 5.5 19.976 0.108
## 5 1990/5/1 5.4 27.721 0.107
## 6 1990/6/1 5.4 27.712 0.108
## avg_price_gasolone_la civilian_labor_force_la_pch cpi_allitems_la
## 1 0.957 #N/A 1.1
## 2 0.988 0.2 1.1
## 3 1.014 0.1 0.7
## 4 1.030 -0.4 -0.2
## 5 1.080 0.4 0.3
## 6 1.103 0 0.3
## economics_cond_index_la unemployed_num_pch
## 1 NA #N/A
## 2 1.24 -4.3
## 3 0.58 -3.8
## 4 -0.35 1
## 5 -0.42 -0.9
## 6 -0.64 -0.5
## new_private_housing_structure_issue_la home_price_index_la
## 1 5121.609 0.7
## 2 4648.972 0.4
## 3 3628.443 0.2
## 4 3833.476 -0.1
## 5 3466.321 -1.1
## 6 3496.684 -0.3
## allemployee_nonfarm_la_pch allemployee_constr_la_pch allemployee_manu_la_pch
## 1 #N/A #N/A #N/A
## 2 0.1 -1.2 0
## 3 -0.1 -1.1 -0.4
## 4 -0.2 -3.7 -0.4
## 5 -0.2 -1.3 -0.4
## 6 -0.3 -0.9 -0.7
## allemployee_finan_la_pch allemployee_leisure_la_pch new_patent_ass_la
## 1 #N/A #N/A 202
## 2 -0.1 -0.4 218
## 3 -0.7 -0.1 290
## 4 -0.2 -0.6 256
## 5 -0.8 -0.1 262
## 6 -0.3 0.3 242
## govn_social_insu_pch compen_employee_wage_pch real_disp_inc_per_capital_pch
## 1 1.6 0.5 0.5
## 2 0.0 1.2 0.1
## 3 1.0 0.7 -0.1
## 4 0.0 0.9 0.5
## 5 0.3 -0.3 -0.2
## 6 1.3 0.8 0.0
## bbk_real_gdp us_interest_rate pers_consum_expen_pch pers_saving_rate
## 1 4.8776608 7 1.3 8.0
## 2 6.1814509 7 -0.1 8.6
## 3 2.9195562 7 0.7 8.3
## 4 -0.5634379 7 0.4 8.8
## 5 0.7507924 7 0.2 8.7
## 6 1.1771073 7 0.8 8.6
## pers_current_tax_chg govn_social_ben_toperson_pch federal_fund_eff_rate
## 1 -8.4 5.3 8.229032
## 2 8.1 -0.2 8.237143
## 3 5.2 0.7 8.276774
## 4 4.2 0.6 8.255000
## 5 0.6 -0.5 8.176452
## 6 3.9 1.2 8.288667
## X30_year_fixed_mortgage
## 1 1.54967
## 2 3.05710
## 3 0.69135
## 4 0.99338
## 5 1.03664
## 6 -2.99213
Although all variables are supposed to be numeric, but in fact some of them are imported as character variables. Also, we need to deal with missing value in columns contained “PCH” which means “Percent Change”.
unprocessed = unprocessed[-1,]
date = unprocessed[,1]
# delete NA in the first row
unprocessed = unprocessed[,-1] %>% mutate_if(is.character, as.numeric)
unprocessed$DATE = date
unprocessed = unprocessed %>%
select(DATE, everything())
# delete variables that have at least 8 missing values
processed_data = unprocessed %>% select(-avg_price_pipedgas_la, -new_patent_ass_la,
-cpi_allitems_la, -us_interest_rate,
-economics_cond_index_la) %>%
head(-2)
sapply(processed_data, function(x) sum(is.na(x)))
## DATE unemploy_rate_la
## 0 0
## avg_price_electr_kwh_La avg_price_gasolone_la
## 0 0
## civilian_labor_force_la_pch unemployed_num_pch
## 0 0
## new_private_housing_structure_issue_la home_price_index_la
## 0 0
## allemployee_nonfarm_la_pch allemployee_constr_la_pch
## 0 0
## allemployee_manu_la_pch allemployee_finan_la_pch
## 0 0
## allemployee_leisure_la_pch govn_social_insu_pch
## 0 0
## compen_employee_wage_pch real_disp_inc_per_capital_pch
## 0 0
## bbk_real_gdp pers_consum_expen_pch
## 0 0
## pers_saving_rate pers_current_tax_chg
## 0 0
## govn_social_ben_toperson_pch federal_fund_eff_rate
## 0 0
## X30_year_fixed_mortgage
## 0
write.csv(processed_data, "D:\\UCSB\\Spring_2022\\PSTAT 131\\PSTAT_131_HW\\HW2\\PSTAT-131\\Final Project\\data\\processed_data.csv", row.names = FALSE)
# start with Feb since we delete the first row of unprocessed data
processed = ts(unprocessed[,-1],frequency = 12, start = c(1990,2))
autoplot.zoo(processed[,"unemploy_rate_la"])+
ggtitle("Unemployment Rate in LA County") +
xlab("Year") +
ylab("Percentage%")
We discovered that there are a few well-known recession periods from
1990 to now. The collapse of internet bubble, the financial crisis of
2007, and Covid-19 pandemic all matches severe increase of the
unemployment rate. We may study how different economics indexes
fluctuates during the financial crisis of 2007 which is mainly caused by
the mortgage debt.
Now we want to discover the seasonal pattern of the unemployment. Except the fluctuation during the Covid-19 seems abnormal, we discover that the unemployment rate usually peaks in summer. This is a problem requires further research.
ggseasonplot(processed[,"unemploy_rate_la"]) +
ggtitle("Seasonal Plot of Unemployment Rate in LA County") +
xlab("Year") +
ylab("Percentage%")
ggsubseriesplot(processed[,"unemploy_rate_la"]) +
ylab("$ million") +
ggtitle("Seasonal subseries plot: LA Unemployment Rate")
Net Migration
census_api_key("7540e4d61b8467521425225cbe8f44f7c1667f9a")
net_migration <- get_estimates(geography = "county", state = "CA",
variables = "RNETMIG",
year = 2019,
geometry = TRUE,
resolution = "20m") %>%
shift_geometry()
order = c("-15 and below", "-15 to -5", "-5 to +5", "+5 to +15", "+15 and up")
net_migration <- net_migration %>%
mutate(groups = case_when(
value > 15 ~ "+15 and up",
value > 5 ~ "+5 to +15",
value > -5 ~ "-5 to +5",
value > -15 ~ "-15 to -5",
TRUE ~ "-15 and below"
)) %>%
mutate(groups = factor(groups, levels = order))
state_overlay <- states(
cb = TRUE,
resolution = "20m"
) %>%
filter(GEOID != "72") %>%
shift_geometry()
ggplot() +
geom_sf(data = net_migration, aes(fill = groups, color = groups), size = 0.1) +
scale_fill_brewer(palette = "PuOr", direction = -1) +
scale_color_brewer(palette = "PuOr", direction = -1, guide = FALSE) +
labs(title = "Net migration per 1000 residents in CA",
subtitle = "US Census Bureau 2019 Population Estimates",
fill = "Rate") +
theme_minimal(base_family = "Roboto")
We found LA county is losing population this 5 years. It requires
further discussion how this trend will effect the unemployment rate.
Median Age
#median age
med_age <- get_acs(state = "CA", county = "Los Angeles", geography = "tract",
variables = "B01002_001", geometry = TRUE)
med_age %>%
ggplot(aes(fill = estimate)) +
geom_sf(color = NA) +
scale_fill_viridis_c(option = "magma")
model_data = read.csv("D:/UCSB/Spring_2022/PSTAT 131/PSTAT_131_HW/HW2/PSTAT-131/Final Project/data/processed_data.csv")
I first use data from 1990-2016 to test whether these models can only use previous values of the unemployment rate and date to forecast in a satisfactory accuracy. Further, I want to utilize models which consider other economics indicators. Hopefully, I can create models that can first perceive economics situation in the US and then determine the trend of the unemployment rate.
Good forecasts capture the genuine patterns and relationships which exist in the historical data, but do not replicate past events that will not occur again. When forecasting time series data, the aim is to estimate how the sequence of observations will continue into the future. Therefore, the main concern may be only to predict what will happen, not to know why it happens.
data = model_data %>%
select(DATE, unemploy_rate_la) %>%
mutate(DATE, DATE = as.Date.character(DATE))
data = data[1:320,] # I don't want to include pandemic
data %>%plot_time_series(DATE, unemploy_rate_la)
Splitting the data set and creating the training and testing set
splits <- initial_time_split(data, prop = 0.9)
Add fitted models to a Model Table
models_tbl <- modeltime_table(
model_fit_arima_no_boost,
model_fit_arima_boosted,
model_fit_ets,
model_fit_prophet,
model_fit_lm,
wflw_fit_mars
)
models_tbl
## # Modeltime Table
## # A tibble: 6 x 3
## .model_id .model .model_desc
## <int> <list> <chr>
## 1 1 <fit[+]> ARIMA(4,0,1)(2,1,1)[12]
## 2 2 <fit[+]> ARIMA(2,0,2)(1,1,1)[12] W/ XGBOOST ERRORS
## 3 3 <fit[+]> ETS(A,AD,A)
## 4 4 <fit[+]> PROPHET
## 5 5 <fit[+]> LM
## 6 6 <workflow> EARTH
Calibrate the model to a testing set
calibration_tbl <- models_tbl %>%
modeltime_calibrate(new_data = testing(splits))
calibration_tbl
## # Modeltime Table
## # A tibble: 6 x 5
## .model_id .model .model_desc .type .calibration_da~
## <int> <list> <chr> <chr> <list>
## 1 1 <fit[+]> ARIMA(4,0,1)(2,1,1)[12] Test <tibble>
## 2 2 <fit[+]> ARIMA(2,0,2)(1,1,1)[12] W/ XGBOOS~ Test <tibble>
## 3 3 <fit[+]> ETS(A,AD,A) Test <tibble>
## 4 4 <fit[+]> PROPHET Test <tibble>
## 5 5 <fit[+]> LM Test <tibble>
## 6 6 <workflow> EARTH Test <tibble>
calibration_tbl %>%
modeltime_forecast(
new_data = testing(splits),
actual_data = data
) %>%
plot_modeltime_forecast(
.legend_max_width = 25
)
calibration_tbl %>%
modeltime_accuracy() %>%
table_modeltime_accuracy(
.interactive = FALSE
)
| Accuracy Table | ||||||||
|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | mae | mape | mase | smape | rmse | rsq |
| 1 | ARIMA(4,0,1)(2,1,1)[12] | Test | 0.73 | 11.89 | 2.19 | 11.09 | 0.85 | 0.64 |
| 2 | ARIMA(2,0,2)(1,1,1)[12] W/ XGBOOST ERRORS | Test | 0.72 | 12.68 | 2.18 | 11.22 | 1.01 | 0.67 |
| 3 | ETS(A,AD,A) | Test | 0.60 | 10.02 | 1.82 | 9.35 | 0.74 | 0.85 |
| 4 | PROPHET | Test | 6.37 | 100.78 | 19.17 | 64.20 | 6.60 | 0.52 |
| 5 | LM | Test | 2.35 | 38.82 | 7.07 | 30.58 | 2.67 | 0.00 |
| 6 | EARTH | Test | 0.64 | 10.57 | 1.94 | 9.80 | 0.75 | 0.94 |
It seems that